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SMOOTH OPTIMIZATION 
WITH APPROXIMATE GRADIENT 


ALEXANDRE D’ASPREMONT* 

Abstract. We show that the optimal complexity of Nesterov’s smooth first-order optimization 
algorithm is preserved when the gradient is only computed up to a small, uniformly bounded error. In 
applications of this method to semidefinite programs, this means in some instances computing only a 
few leading eigenvalues of the current iterate instead of a full matrix exponential, which significantly 
reduces the method’s computational cost. This also allows sparse problems to be solved efficiently 
using sparse maximum eigenvalue packages. 

1. Introduction. In [13] it was shown that smooth convex minimization prob¬ 
lems of the form: 

minimize fix) 
subject to X G Q, 

where / is a convex function with Lipschitz continuous gradient and Q is a sufficiently 
simple compact convex set, could be solved with a complexity of Oil/^/e), where e 
is the precision target. Furthermore, it can be shown that this complexity bound 
is optimal for that class of problems (see [14] for a dicussion). More recently, [15] 
showed that this method could be combined with a smoothing argument to produce 
an 0(l/e) complexity bound for non-smooth problems where the objective has a 
saddle-function format. In particular, this meant that a broad class of semidefinite 
optimization problems could be solved with significantly lower memory requirements 
than interior point methods and a better complexity bound than classic first order 
methods (bundle, subgradient, etc). 

Here, we show that substituting an approximate gradient, which may allow sig¬ 
nificant computation and storage savings, does not affect the optimal complexity of 
the algorithm in m- It is somewhat intuitive that an algorithm which exhibits good 
numerical performance in practice should be robust to at least some numerical error 
in the objective function and gradient computations since all implementations are 
necessarily computing these quantities up to some multiple of machine precision. Our 
objective here is to make that robustness explicit in order to design optimal schemes 
using only approximate gradient information. 

For non-smooth problems, when the objective function f{x) can be expressed as a 
saddle function on a compact set, the method in m starts by computing a smooth (i.e. 
with Lipschitz continuous gradient), uniform e-approximation of the objective function 
f{x), it then uses the smooth minimization algorithm in |13j to solve the approximate 
problem. When this smoothing technique is applied to semidefinite optimization, 
computing exact gradients requires forming a matrix exponential, which is often the 
dominant numerical step in the algorithm. 

Although there are many different methods for computing this matrix exponential 
(see m for a survey), their complexity is comparable to that of a full eigenvalue de¬ 
composition of the matrix. In problem instances where only a few leading eigenvalues 
suffice to approximate this exponential, the per iteration complexity of the algorithm 
described here becomes comparable to that of classical first-order methods such as the 
bundle method (see i) or subgradient methods (see m for example), which have a 
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global complexity bound of 0(l/e^) (see [2]), while keeping the optimal complexity 
of 0(l/e) of the algorithm in [T5] . 

We apply this result to a maximum eigenvalue minimization problem (or semidef- 
inite program with constant trace). We first recall the complexity bound derived in 
[l6] based on a smoothing argument, using exact gradients. We produce a rough 
theoretical estimate of the number of eigenvalues required for convergence when ap¬ 
proximate gradients are used. We then derive an explicit condition on the quality of 
the gradient approximation to guarantee convergence and compute a bound on the 
number of iterations. We show both on randomly generated problem instances and 
on problems generated from biological data sets that actual computational savings 
vary significantly with problem structure but can be substantial in some cases. 

The paper is organized as follows. In the next section, we prove convergence of 
the algorithm in [13| when only an approximate gradient is used. In Section [3] we 
describe how these results can be applied to semidefinite optimization. Finally, in 
the last section we test their performance on semidefinite relaxation and maximum 
eigenvalue minimization problems. 

2. Smooth optimization with approximate gradient. Following the results 
and notations in [151 §3]) we study the problem: 

_ minimize f{x) 

' ' ' subject to X £ Q, 

where Q C R" is a compact convex set and / is a convex function with Lipschitz 
continuous gradient, such that: 

l|V/(a:)-V/(y)|r <L||x-2/||, x,yeQ, 

for some L > 0, which also means: 

(2.2) f{y) < f{x) + {Vf{x),y - x)+ ^L\\y - x\\^, x,yGQ. 

The key difference here is that the oracle information we obtain for V/ is noisy. 
Note that the function values are not required to compute iterates in the algorithm 
described here, so even if our knowledge of function values f{x) is noisy, we will 
always use exact values in the proofs that follow. At each iteration, we obtain V/(a;) 
satisfying: 

(2.3) \{Vf{x)-\'f{x),y - z)\ < S x,y,zGQ, 

for some precision level 5 > 0. Throughout the paper, we assume that Q is simple 
enough so that this condition can be checked efficiently. As in m, we also assume 
that certain projection operators on Q can be computed efficiently and we refer the 
reader to the end of this section for more details. Here, d{x) is a prox-function for 
the set Q, i.e. continuous and strongly convex on Q with parameter cr (see [H] or |7] 
for a discussion of regularization techniques using strongly convex functions). We let 
xq be the center of Q for the prox-function d(x) so that: 

Xq = argmin(i(a;), 

x^Q 

assuming w.l.o.g. that di^x^) = 0, we then have: 

(2.4) d{x)>]^<T\\x-Xo\\'^. 
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We denote by Tq{x) a solution to the following subproblem: 

(2.5) Tq{x) = aigTohii {V f{x),y - x) + \-L\\y - xW"^ 

yeQ L ^ 

We let 2/0 = Tq{xo) where xq is defined above. We recursively define three sequences 
of points: the current iterate {xk\, the corresponding yj- = Tg^Xk), together with 

(2.6) Zk = argmin'l —d{x) +y^ a^[f{x^) + {Vf{x^),x- Xi)] 
and a step size sequence {ak} > 0 with ao S (0,1] so that 


/n Y". “t“ (1 '^k')yk 

yk +1 = fQ{xk+i) 

where Tk = at+i/Ak+i with Ak = We implicitly assume here that the 

two subproblems defining yk and Zk can be solved very efficiently (in the examples 
that follow, they amount to Euclidean projections). We will show recursively that 
for a good choice of step sequence a/c, the iterates Xk and yk satisfy the following 
relationship (denoted by TZk)'- 


Akf{yk)<i’k + Akg{k,6) (TZk) 


where g{k, 6) measures the accumulated gradient approximation error and will be 
bounded in Lemma ITTl and 


2 /jfe = min < —d{x) + ai[f{xi) + (Vfixi), x - Xi)\ \ . 

J 

First, using d(x) > ^cr||a: — xo|P, then inequality (12.2() and condition ()2.3p . we have: 

ipo = mini—d(x) +ao[f{xo) + (V/(a;o), a: - xq)] 1 > ao/(yo) - aoS 

xeQ p (T J 

which is TZo- We can then bound the approximation error in the following result. 
Lemma 2.1. Let ak be a step sequence satisfying: 

(2.8) 0 < ao < 1 and a\ < Ak, k >0, 

suppose that (TZk) holds with Xk+i and yk+i are defined as in d^. 7| ), then (TZk+i) holds 
with: 


g{k + 1, (5) = (1 - Tk)g{k, 5) + 
where Tk G [0,1] and g{0, S) = ao6. 

Proof. Let us assume that {TZk) holds. Because d{x) is strongly convex with 
parameter a, the function: 

L ^ ^ 

—d{x) + ^ ai[f{xi) + (V/(a:j), x - Xi)] 

^ i=0 
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is strongly convex with parameter L. Using this property and the definition of Zk we 
obtain: 


[ 2 , '=+1 _ 'I 

V'fe+i = min -d{x) + 'V' ai[f{x^) + {Vfixi), x - Xi)] ) 

x£Q a 

I ) 


x^Q 


> min ipk + -zLWx - Zk\\'^ + ak+i[f{xk+i) + {Wf{xk+i),x- Xk+i)] > • 


Now, using {TZk) then the convexity of f{x), we get: 

V'/c + Akg{k,5) + ak+i[f{xk+i) + f{xk+i),x - Xk+i)] 

> Akfiuk) + ak+i[f{xk+i) + {Vf(xk+i),x - Xk+i)] 

> Aklfixk) + {Vf{xk+i),yk - Xk+i)] + ak+i[fixk+i) + {Vf{xk+i),x- x^+i)], 
and condition (I2.3I1 . together with (EH imply that: 

Aklfixk) + {Vf{xk+i),yk - Xk+i)] + afc+i[/(xfc+i) + (V/(xfe+i),x - x^+i)] 

> Ak+if{xk+i) + {Vf{xk+i),Akyk - AkXk+i + ak+i{x - Xk+i)) - ak+iS 

= Ak+lf{xk+l) + Q!fe+i(V/(x/c+l),X - Zk) - ak+ 16 . 

Because ak satisfies (j2.8L we have and can combine the last three inequal¬ 

ities to get: 

l/'fe+i > Ak+if{xk+i) - Akgik^S) - ak+iS 

^2 9) -hmina;gQ {^L\\x - -h Ofe+i(V/(xfc+i),x - Zk)} 

> Ak+i[fixk+i)-il-Tk)g{k,S)-TkS 

-|-mina;gQ {iLr^llx - ZfclP + rfe(V/(x/c+i), x - Zk)}]- 
Let us define y = TkX -f (1 — Tk)yk so that y — Xk+i = Tk{x — Zk), with: 

mina;gQ {iLr|||x - Zfe|p -|- rfc(V/(xfe+i), x - Zk)} 

= min{yg^,Q+(i_^,)j^,} {\L\\y - Xfc+ip -(- (V/(xfe+i), y - x^+i)} . 

Combining condition (|2.3I) with the fact that y — Xk+i = Tk{x — Zk) for some x,Zk G Q, 
we get: 

min{y 6 r,Q-H(l-rfc)j/fc} “ Xfc +1 f -f (V/(Xfc+l), ?/ - Xfe+i)} 

— '^in-|ygT-j,Q+(i— 2~ Xk-\-i II -l- (V/(xfe_|_i), y — Xk+i) j" ~ 

Now, because Q is convex, we must have XkQ -I- (1 — Tk)yk C Q and: 

inin{yer,Q+(i-r,)yU {\L\\y - Xk+iV + (V/(xfe+i), y - Xfe+i) | - TkS 
> min^gQ |iL|lj/- Xfc+i|p -h (V/(xfe+i),y - Xfe+i)| - TkS. 
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By the definition of Uk+i = TQ(xk+i) and using condition (12.3|) . we get: 


minj/gQ |iL||?/ - Xk+iW^ + {Vf{xk+i),y- Xfe+i)| - TkS 
= iL||fQ(xfc+i) - Xk+iW^ + (V/(xfe+i),fQ(xfe+i) - Xfc+i) - TkS 

> ^L\\yk+i - Xfe+i|p + (V/(xfe+i),y/c+i - Xfc+i) - 2Tfe(5, 

and inequality (j2.2ll gives: 

iL||j/fc+i - Xk+iW^ + (V/(xfe+i), yk+i - Xk+i) - 2Tkd, 

> fiVk+l) - f{xk+l) - 2TkS. 

Combining these inequalities with the inequality on t/jk+i in (EH), we finally get: 

i’k+i > Ak+i [f{yk+i) - (1 - Tk)g{k,6) - 3rfc(5] 
which is the desired result. □ 

We can use this result to study the convergence of the following algorithm given 
only approximate gradient information. 


Smooth minimization with approximate gradient. 

Starting from xq, the prox center of the set Q, we iterate: 

1. compute Vf{xk), 

2. compute yk = fQ{xk), 

3. compute Zk = argmin^^gg |^(i(x) + J2i=o + (V/(xj),x - Xi)]|, 

4. update x using Xk+i = TkZk + (1 - Tk)yk, 


Again, because solving for yk and Zk can often be done very efficiently, the dom¬ 
inant numerical step in this algorithm is the evaluation of V/(xfe). If the step size 
sequence ak satisfies the conditions of Lemma l2.11 we can show the following conver¬ 
gence result: 

Theorem 2.2. Suppose ak satisfies equation i2.8\} . with the iterates Xk and yk 
defined in \2.6\) and then for any k > 0 we have: 

f{yk)-f{x*)<^^^ + ^S 
Akcr 

where x* is an optimal solution to problem EH- 

Proof. If ak satisfies the hypotheses of lemma ETTI we have: 

Akfiyk) < V’fe + ^kdik, S) 

where Ak = because /(x) is convex, we also have: 

'fik < —d{x*) + Akf{x*) + Ak'SS 
a 

which yields the desired result. □ 
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When d{x*) < +oo (e.g. if Q is bounded), if we set the step sequence as = 
{k + l)/2 and 6 to some fraction of the target precision e (here e/6), grows as 
0{k‘^) and Theorem 12.21 ensures that the algorithm will converge to an e solution in 
less than: 


( 2 . 10 ) 


8Ld{x*) 


iterations. In practice of course, d{x*) needs to be bounded a priori and L is often 
hard to evaluate. A notable exception is when f{x) is a smooth approximation (as in 
[T5l[T6] for example), in which case L is know explicitly as a function of the precision. 
We have implicitly assumed, as in m, that the set Q is simple enough so that 
the complexity of solving the two minimization subproblems in steps 2 and 3 of the 
algorithm is low relative to that of approximating the gradient. We also implicitly 
assumed that the set Q is simple enough so that condition can be checked 

efficiently. In the numerical experiments of Section [?] for example, steps 2 and 3 are 
Euclidean projections on the unit box and condition (12.31) is a simple inequality on 
the leading eigenvalues of the current iterate. 


3. Semidefinite optimization. Here, we describe in detail how the results of 
the previous section can be applied to semidefinite optimization. We consider the 
following maximum eigenvalue problem: 

- . minimize A™“(A^?/ + c) — b^y 

^ ' subject to y & Q, 


in the variable y G R'", with parameters A G ^ b G R™ and c G R” . Let 

us remark that when Q is equal to R"*, the dual of this program is a semidefinite 
program with constant trace written: 

maximize (Fx 
subject to Ax = b 

Tr(x) = 1 
a; ^ 0, 

2 

in the variable x G R” , where Tr(x) = 1 means that the matrix obtained by reshap¬ 
ing the vector x has trace equal to one and x y 0 means that this same matrix is 
symmetric, positive semidefinite. 

3.1. Smoothing technique. As in m, [is]> m or [5] we can find a uniform 
e-approximation to A™“(A) with Lipschitz continuous gradient. Let y > 0 and 
A G S„, we define: 


/„(X) =Mlog fee'-™-) = flog (‘■"S'" 

where Ai(A) is the eigenvalue of X. This is also: 

(3.3) f^{X) = A”“(A) + MlogTr (^exp ~ 

which requires computing a matrix exponential at a numerical cost of 0(n^). We then 
have: 




Amax(^) < < A““(A) + y\ogn, 
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so if we set fi = e/logn, f^{X) becomes a uniform e-approximation of A™“(X). In 
m it was shown that f^{X) has a Lipschitz continuous gradient with constant: 

^ ^ _ logn 

e 

The gradient Vffi{X) can also be computed explicitly as: 


(3.4) 


exp(^^a^) 

'(exp(- JT^)) 


using the same matrix exponential as in (13.31) . Let ||y|| be some norm on R"* and 
d{x) a strongly convex prox-function with parameter cr > 0. As in m, we define: 

Pll = max ||A^/i||2, 


where || A^/i ||2 = max, |Ai(A^h)|. The algorithm detailed in where exact function 
values and gradients are computed, will find an e solution to 63 ) after at most: 

(3.5) 

e V cr 

iterations, each iteration requiring a matrix exponential computation. 


3.2. Spectrum & expected performance gains. The complexity estimate 
above is valid when the matrix exponential in (13.31) is computed exactly, at a cost of 
O(n^). As we will see below, only a few leading eigenvalues are sometimes required to 
satisfy conditions (E3 and obtain a comparable complexity estimate at a much lower 
numerical cost. To illustrate the potential complexity gains, let us pick a matrix 
A G S„ whose coefficients are centered independent normal variables with second 
moment given by cr^/n. From Wigner’s semicircle law, A™'^’'(A) ^ 2cr as n goes 
to infinity and the eigenvalues of X are asymptotically distributed according to the 
density: 


p(x) = ^;^\/4cr2 - x^, 

which means that, in the limit, the proportion of eigenvalues required to reach a 
precision of 7 in the exponential is given by: 


Pa = P 


e 



x'^dx. 


Since the problems under consideration are relaxations of sparse PCA, we can also 
consider the case where A G S„ is sampled from the Wishart distribution. In that 
case, the eigenvalues are distributed according to the Marcenko-Pastur distribution 
(see [To]) and the above proportion becomes: 


Px=P 


( Ai(X)-X‘°»^(X) 

e 




yj a:(4cr — a;) 
27rx 


dx. 


With n = 5000, 7 = I0“® and e = 10“^, we get nP\ = 2.3, so the approxima¬ 
tions above would suggest that, in theory, it is only necessary to compute about 
three eigenvalues per iteration to get an approximation with precision 7 = I0“®. In 
practice however, the results of Section [H show that these rough estimates should be 
significantly tempered. 
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3.3. Global complexity bound. Let us now focus on the following program: 


(3.6) 


minimize A™“(A^j/ + c) 
subject to ||y|| < /3, 


where the set Q is here explicitly given by : 

Q = {yGRP-. 


<P}: 


for some (3 > 0 with ||.|j the Euclidean norm here. We can pick ||a:|p/2 as a prox 
function for Q, which is strongly convex with convexity parameter 1. Let X{X) G R” 
be the eigenvalues of the matrix X = A^y + c, in decreasing order, with Ui{X) G R" 
an orthonormal set of eigenvectors. The gradient matrix of exjp{X/y,) is written: 


V^(X) = 



y e—u,{X)u,iX f 


Suppose we only compute the first m eigenvalues and use them to approximate this 
gradient by: 



= 

we get the following bound on the error: 

||V/^(X)-V^(X)|| < 


y e Ui{X)ui{X)^ 


'/2{n — m)e 


Am(X)-Ai(X) 


(Er=. 


Xi(X)-Ai(X) 


) ■ 


In this case, with X = A^y — c here, condition (12.3p means that we only need to 
compute m eigenvalues with m such that: 


(3.7) 


'/2{n — m)e 


^m(X)-Xi(X) 


(sr.! 


XaX)-Ai(X) \ ■ — 




where (t™“(A) is the largest singular value of the matrix A. Using the result in [T6] . 
if we define ||A|j = max||/i||^i ||A/i ||2 and set 5 = e/6, the algorithm in Sectionwill 
then converge to an e-solution of problem (13.6p in at most: 


(3.8) 


4PII/3 




iterations. This bound on the number of iterations is independent of m in condition 
dSH), i.e. the number of eigenvalues required at each iteration. The cost per itera¬ 
tion however varies with problem structure as each iteration requires computing m 
leading eigenvalues, which can be performed in OimA) operations. Note that par¬ 
tial eigenvalue decompositions only access the matrix through matrix-vector products 
(see 0), hence can handle sparse problems very efficiently. The threshold <5 can be 
adjusted empirically to tradeoff between the number of iterations and the numerical 
cost of each iteration. Unfortunately, we can’t directly infer a bound on m from the 
structure of A, so in the next section we study the link between m and the matrix 
spectrum in numerical examples. 















4. Examples & numerical performance. In this section, we illustrate the 
behavior of the approximate gradient algorithm on various semidefinite optimization 
problems. Overall, while there appears to be a direct link between problem structure 
and complexity (i.e. the number of eigenvalues required in the gradient approxi¬ 
mation) in the first sparse PGA example discussed below, we will observe on ran¬ 
dom maximum eigenvalue minimization problems that predicting complexity based 
on overall problem structure remains an open numerical question in general. 

4.1. Sparse principal component analysis. Based on the results in [3], the 
problem of finding a sparse leading eigenvector of a matrix C G S„ can be written: 

maximize x^Cx 

(4.1) subject to ||a :||2 = 1 

Card(a;) < fc, 

where Card(a:) is the number of nonzero coefficients in x, and admits the following 
semidefinite relaxation: 

maximize Tr(C'Al) — pl^| ATj! 

(4.2) subject to Tr(Ar) = 1 

a: h 0 , 

which is a semidefinite program in the variable € S”, where p > 0 is the penalty 
controlling the sparsity of the solution. Its dual is given by: 

. minimize + U) 

' ' subject to \Uij\<p, f,j = l,...,n, 

which is of the form (EH) with 

Q = {17 G S„ : \Uij\< p, i,j = 1,... ,n} . 

The smooth algorithm detailed in Section [5] is explicitly described for this problem 
in [3] and implemented in a numerical package called DSPCA which we have used in 
the examples here. To test its performance, we generate a matrix M with uniformly 
distributed coefficients in [0,1]. We let e G be a sparse vector with: 

e = (1,0,1,0,1,0,1,0,1,0,0,0 ,...). 

We then form a test matrix C = M'^M + vee^, where u is a signal-to-noise ratio. 

In Figure SH on the left, we plot duality gap versus CPU time used for values of 
the signal to noise ratio v ranging from 10 to 100. In Figure HTT] on the right, we plot 
number of eigenvalues required against computing time using a covariance matrix of 
dimension n = 500 sampled from the colon cancer data set in [T], and a noisy rank 
one matrix. Finally, we measure total computing time versus problem dimension n on 
this same data set, by solving problem (14.21) for increasingly large submatrices of the 
original covariance matrix. In each of these examples, we stop after the duality gap has 
been reduced by 10 “^, which is enough here to identify sparse principal components. 
In Figure 14.21 on the left, we plot computing time versus target precision in loglog 
scale, on a sparse PCA problem of size 200 extracted from the colon cancer data set. 
In the previous section, we have seen that precision impacts computing time both 
through the total number of iterations in (13.51) and through condition ()3.7I) on the 
number of eigenvalues required in the gradient approximation. In this example, we 
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Fig. 4.1. Left: Duality gap versus CPU time for various values of the signal to noise ra¬ 
tio V. Right: Percentage of eigenvalues required versus CPU time, for various values of the penalty 
parameter p controlling sparsity. 
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Precision target 


Fig. 4.2. Left: CPU time (in seconds) versus target precision in loglog scale. Right: average 
percentage of eigenvalues required at each iteration versus target precision, in semilog scale. 


observe that CPU time increases a little bit slower than the upper bound of 0{\/e) 
given in (lU). In Figure ir^ on the right, we plot the average percentage of eigenvalues 
required at each iteration versus target precision, in semilog scale. We observe, on 
this example of dimension 200, that for low target precisions, one eigenvalue is often 
enough to approximate the gradient, but that this number quickly increases for higher 
precision targets. Note that in all cases, the precision targets are significantly lower 
than those achieved by interior point methods (usually at least 10“®) but the cost per 
iteration and storage requirements of the first-order algorithms detailed here are also 
significantly lower. 

In Table 14.11 we then compare total CPU time using a full precision matrix 
exponential, against CPU time using only a partial eigenvalue decomposition to ap¬ 
proximate this exponential. Note that other classic methods for computing the matrix 
exponential such as Pade approximations (see m): did not provide a significant per¬ 
formance benefit and are not included here. Both exact and approximate gradient 
codes are fully written in C, with partial eigenvalue decompositions computed using 
the FORTRAN package ARPACK (see [8]) with calls to vendor optimized BLAS and 
LAPACK for matrix operations. To improve stability, the size of the Lanczos basis 
in ARPACK was set at four times the number of eigenvectors required. We observe 
that, on these problems, the partial eigenvalue decomposition method is about ten 
times faster. 
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n = 100 

n = 200 

n = 500 

Rank one, Full 

3.2 

8.0 

14.7 

Rank one. Partial 

0.4 

0.75 

1.6 

Golon, Full 

2.6 

18.1 

274.3 

Colon, Partial 

0.3 

1.3 

17.7 


Table 4.1 

CPU time (in seconds) versus problem dimension n for full and partial eigenvalue matrix 
exponential computations. 


4.2. Matrix structure and complexity: open numerical issues. The pre¬ 
vious section showed how the spectrum of the current iterate impacts the complexity 
of the algorithm detailed in this paper: a steeply decreasing spectrum allows fewer 
eigenvalues to be computed in the matrix exponential approximation, and a wider 
gap between eigenvalues improves the convergence rate of these eigenvalue compu¬ 
tations. In this section, we study the number of eigenvalues required in randomly 
generated maximum eigenvalue minimization problems. Because of the measure con¬ 
centration phenomenon, there is nothing really random about the spectrum of large- 
scale, naively generated semidefinite optimization problems so we begin by detailing 
a simple method for generating random matrices with a given spectrum. 

Generating random matrices with a given spectrum. Suppose X £ S„ is a matrix 
with normally distributed coefficients, Xij ^ Af{0, 1), i, j = 1,... ,n. If we write its 
QR decomposition, X = QR with Q, R G R”^", then the orthogonal matrix Q is 
Haar distributed on the orthogonal group On (see [4] for example). This means that 
to generate a random matrix with given spectrum A £ R”, we generate a normally 
distributed matrix X, compute its QR decomposition and the matrix Qdiag(A)Q^ 
will be uniformly distributed on the set of symmetric matrices with spectrum A. 

Maximum eigenvalue minimization. We now form random maximum eigenvalue 
minimization problems, then study how the number of required eigenvalues in the 
gradient computation evolves as the solution approaches optimality. We solve the 
following problem: 

minimize X^^^^A^y + c) 
subject to ||y|| < /3, 

in the variable y £ R"*, where c £ R" , A £ and /3 > 0 is an upper bound 

on the norm of the solution. In Figure IT751 we plot percentage of eigenvalues required 
in the gradient computation versus duality gap for randomly generated problem in¬ 
stances where n = 50 and m = 25. The first two plots use data matrices with Gaus¬ 
sian and Wishart distributions, whose spectrum are distributed according to Wigner’s 
semicircle law and the Marcenko-Pastur distribution respectively. The last two plots 
use the procedure described above to generate matrices with uniform spectrum on 
[0,1], and uniform spectrum on [0,1] with one eigenvalue set to 5. We observe that 
the number of eigenvalues required in the algorithm varies signihcantly with matrix 
spectrum. 

Problem structure and effective complexity. The results on sparse PGA in M.ll 
and on the random problems of this section show that problem structure has a sig¬ 
nificant impact on performance. Predicting how many eigenvalues will be required 
at each iteration based on structural properties of the problem is an important but 
difficult question. In particular, the number of eigenvalues required in the Gaus¬ 
sian case is much higher than what the asymptotic analysis in Section [3.21 predicted. 
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Furthermore, in the sparse PCA example, complexity seems to vary with problem 
structure somewhat intuitively: higher signal to noise ratio means lower complexity 
and a higher sparsity target means higher complexity. However, this is not the case in 
the random problems studied here, two unstructured problems (uniform and Wishart) 
have low complexity while one requires computing many more eigenvalues per itera¬ 
tion (Gaussian) and a more structured example (uniform plus rank one) also requires 
many eigenvalues. Overall then, predicting effective complexity (i.e. the number of 
eigenvalues required at each iteration) based on problem structure remains a difficult 
open question at this point. 

Also, it is well known empirically (see and [TB] among others) that the 

largest eigenvalues of A^y — c in (EH) tend to coalesce near the optimum, thus poten¬ 
tially increasing the number of eigenvalues required when computing f(x) and the 
number of iterations required for computing leading eigenvalues (see [5] for example), 
but in these references too, no a priori link between coalescence and problem struc¬ 
ture is established. This coalescence phenomenon is never apparent in the numerical 
examples studied here, perhaps because it only appears at the much higher precision 
targets reached by interior point methods. 


Gaussian Wishart 



Uniform 


Uniform +1 




Fig. 4.3. Average percentage of eigenvalues required (solid line) versus duality gap on randomly 
generated maximum eigenvalue minimization problems, for various problem matrix distributions. 
Dashed lines at plus and minus one standard deviation. 
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